The IMDb Scraper uses Python’sDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/FrankDevg/imbd_scrapper_project/llms.txt
Use this file to discover all available pages before exploring further.
concurrent.futures.ThreadPoolExecutor to parallelize movie detail extraction, significantly improving scraping performance while maintaining stability.
ThreadPoolExecutor Architecture
The scraper implements thread-based parallelism at two levels:- Movie detail fetching - Parallel HTTP requests for movie pages
- Dual persistence - Concurrent CSV and PostgreSQL writes
Movie Detail Fetching
Implementation
The main scraping loop uses ThreadPoolExecutor to process multiple movies simultaneously:infrastructure/scraper/imdb_scraper.py:40
Worker Function
Each thread executes the_scrape_and_save_movie_detail method:
infrastructure/scraper/imdb_scraper.py:56
Thread Pool Configuration
Configurable Thread Count
The number of concurrent threads is configurable viaconfig.py:
shared/config/config.py:53
Optimal Thread Count
The default configuration uses 50 threads, balancing:- Performance: Parallel HTTP requests reduce total scraping time
- Resource usage: Prevents overwhelming the network or target server
- Rate limiting: Stays within acceptable request rates
Persistence Concurrency
Composite Use Case with ThreadPoolExecutor
The persistence layer also uses threads to write to CSV and PostgreSQL simultaneously:application/use_cases/composite_save_movie_with_actors_use_case.py:25
Parallel Persistence Strategies
When a movie is scraped, it’s saved to both backends concurrently:Thread Safety
CSV Thread Safety
The CSV repository uses threading locks to prevent race conditions:infrastructure/persistence/csv/repositories/movie_csv_repository.py:39
PostgreSQL Thread Safety
PostgreSQL handles concurrent writes through its connection pooling and transaction isolation:infrastructure/persistence/postgres/repositories/movie_postgres_repository.py:21
Performance Benefits
Sequential vs Parallel Execution
Sequential scraping (1 thread):Real-World Performance
With network latency and retry logic:- Sequential: ~15-20 minutes for 250 movies
- Parallel (50 threads): ~2-3 minutes for 250 movies
Resource Management
Automatic Cleanup
ThreadPoolExecutor automatically manages thread lifecycle:Error Isolation
Each thread handles its own errors without affecting other threads:Execution Flow
Configuration Options
Thread Pool Size
Adjust the thread count based on your needs:Recommendations
| Use Case | Recommended Threads | Rationale |
|---|---|---|
| Development/Testing | 5-10 | Easier debugging, clearer logs |
| Production | 30-50 | Optimal balance |
| High-bandwidth environments | 75-100 | Maximum throughput |
Concurrency Trade-offs
Benefits
- Speed: Dramatically faster scraping
- Efficiency: Better CPU and network utilization
- Scalability: Handles large datasets efficiently
Considerations
- Rate limiting: Too many threads may trigger anti-bot measures
- Memory usage: Each thread consumes memory
- Log readability: Parallel execution creates interleaved logs
Alternative Approaches
AsyncIO (Not Used)
Whileasyncio could provide similar benefits, ThreadPoolExecutor was chosen because:
- Simpler implementation for I/O-bound tasks
- Better compatibility with synchronous libraries (requests, psycopg2)
- Easier error handling and debugging
Process-Based Parallelism (Not Used)
multiprocessing.Pool was considered but rejected:
- Higher overhead: Process creation is expensive
- Shared state complexity: Database connections can’t be pickled
- Overkill: Scraping is I/O-bound, not CPU-bound
Example: Adjusting Thread Count
To change the thread pool size:Monitoring Concurrency
The scraper logs concurrent operations:Next Steps
Scraping Engine
Learn about the scraping implementation
Network Evasion
Explore proxy and TOR integration